44 research outputs found
Deep Multiple Instance Learning for Zero-shot Image Tagging
In-line with the success of deep learning on traditional recognition problem,
several end-to-end deep models for zero-shot recognition have been proposed in
the literature. These models are successful to predict a single unseen label
given an input image, but does not scale to cases where multiple unseen objects
are present. In this paper, we model this problem within the framework of
Multiple Instance Learning (MIL). To the best of our knowledge, we propose the
first end-to-end trainable deep MIL framework for the multi-label zero-shot
tagging problem. Due to its novel design, the proposed framework has several
interesting features: (1) Unlike previous deep MIL models, it does not use any
off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation.
(2) During test time, it can process any number of unseen labels given their
semantic embedding vectors. (3) Using only seen labels per image as weak
annotation, it can produce a bounding box for each predicted labels. We
experiment with the NUS-WIDE dataset and achieve superior performance across
conventional, zero-shot and generalized zero-shot tagging tasks
A Unified approach for Conventional Zero-shot, Generalized Zero-shot and Few-shot Learning
Prevalent techniques in zero-shot learning do not generalize well to other
related problem scenarios. Here, we present a unified approach for conventional
zero-shot, generalized zero-shot and few-shot learning problems. Our approach
is based on a novel Class Adapting Principal Directions (CAPD) concept that
allows multiple embeddings of image features into a semantic space. Given an
image, our method produces one principal direction for each seen class. Then,
it learns how to combine these directions to obtain the principal direction for
each unseen class such that the CAPD of the test image is aligned with the
semantic embedding of the true class, and opposite to the other classes. This
allows efficient and class-adaptive information transfer from seen to unseen
classes. In addition, we propose an automatic process for selection of the most
useful seen classes for each unseen class to achieve robustness in zero-shot
learning. Our method can update the unseen CAPD taking the advantages of few
unseen images to work in a few-shot learning scenario. Furthermore, our method
can generalize the seen CAPDs by estimating seen-unseen diversity that
significantly improves the performance of generalized zero-shot learning. Our
extensive evaluations demonstrate that the proposed approach consistently
achieves superior performance in zero-shot, generalized zero-shot and
few/one-shot learning problems
Task-generalizable Adversarial Attack based on Perceptual Metric
Deep neural networks (DNNs) can be easily fooled by adding human
imperceptible perturbations to the images. These perturbed images are known as
`adversarial examples' and pose a serious threat to security and safety
critical systems. A litmus test for the strength of adversarial examples is
their transferability across different DNN models in a black box setting (i.e.
when the target model's architecture and parameters are not known to attacker).
Current attack algorithms that seek to enhance adversarial transferability work
on the decision level i.e. generate perturbations that alter the network
decisions. This leads to two key limitations: (a) An attack is dependent on the
task-specific loss function (e.g. softmax cross-entropy for object recognition)
and therefore does not generalize beyond its original task. (b) The adversarial
examples are specific to the network architecture and demonstrate poor
transferability to other network architectures. We propose a novel approach to
create adversarial examples that can broadly fool different networks on
multiple tasks. Our approach is based on the following intuition: "Perpetual
metrics based on neural network features are highly generalizable and show
excellent performance in measuring and stabilizing input distortions. Therefore
an ideal attack that creates maximum distortions in the network feature space
should realize highly transferable examples". We report extensive experiments
to show how adversarial examples generalize across multiple networks for
classification, object detection and segmentation tasks
Zero-shot Learning of 3D Point Cloud Objects
Recent deep learning architectures can recognize instances of 3D point cloud
objects of previously seen classes quite well. At the same time, current 3D
depth camera technology allows generating/segmenting a large amount of 3D point
cloud objects from an arbitrary scene, for which there is no previously seen
training data. A challenge for a 3D point cloud recognition system is, then, to
classify objects from new, unseen, classes. This issue can be resolved by
adopting a zero-shot learning (ZSL) approach for 3D data, similar to the 2D
image version of the same problem. ZSL attempts to classify unseen objects by
comparing semantic information (attribute/word vector) of seen and unseen
classes. Here, we adapt several recent 3D point cloud recognition systems to
the ZSL setting with some changes to their architectures. To the best of our
knowledge, this is the first attempt to classify unseen 3D point cloud objects
in the ZSL setting. A standard protocol (which includes the choice of datasets
and the seen/unseen split) to evaluate such systems is also proposed. Baseline
performances are reported using the new protocol on the investigated models.
This investigation throws a new challenge to the 3D point cloud recognition
community that may instigate numerous future works
Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts
Current Zero-Shot Learning (ZSL) approaches are restricted to recognition of
a single dominant unseen object category in a test image. We hypothesize that
this setting is ill-suited for real-world applications where unseen objects
appear only as a part of a complex scene, warranting both the `recognition' and
`localization' of an unseen category. To address this limitation, we introduce
a new \emph{`Zero-Shot Detection'} (ZSD) problem setting, which aims at
simultaneously recognizing and locating object instances belonging to novel
categories without any training examples. We also propose a new experimental
protocol for ZSD based on the highly challenging ILSVRC dataset, adhering to
practical issues, e.g., the rarity of unseen objects. To the best of our
knowledge, this is the first end-to-end deep network for ZSD that jointly
models the interplay between visual and semantic domain information. To
overcome the noise in the automatically derived semantic descriptions, we
utilize the concept of meta-classes to design an original loss function that
achieves synergy between max-margin class separation and semantic space
clustering. Furthermore, we present a baseline approach extended from
recognition to detection setting. Our extensive experiments show significant
performance boost over the baseline on the imperative yet difficult ZSD
problem
Mitigating the Hubness Problem for Zero-Shot Learning of 3D Objects
The development of advanced 3D sensors has enabled many objects to be
captured in the wild at a large scale, and a 3D object recognition system may
therefore encounter many objects for which the system has received no training.
Zero-Shot Learning (ZSL) approaches can assist such systems in recognizing
previously unseen objects. Applying ZSL to 3D point cloud objects is an
emerging topic in the area of 3D vision, however, a significant problem that
ZSL often suffers from is the so-called hubness problem, which is when a model
is biased to predict only a few particular labels for most of the test
instances. We observe that this hubness problem is even more severe for 3D
recognition than for 2D recognition. One reason for this is that in 2D one can
use pre-trained networks trained on large datasets like ImageNet, which
produces high-quality features. However, in the 3D case there are no such
large-scale, labelled datasets available for pre-training which means that the
extracted 3D features are of poorer quality which, in turn, exacerbates the
hubness problem. In this paper, we therefore propose a loss to specifically
address the hubness problem. Our proposed method is effective for both
Zero-Shot and Generalized Zero-Shot Learning, and we perform extensive
evaluations on the challenging datasets ModelNet40, ModelNet10, McGill and
SHREC2015. A new state-of-the-art result for both zero-shot tasks in the 3D
case is established.Comment: BMVC 201
Transductive Zero-Shot Learning for 3D Point Cloud Classification
Zero-shot learning, the task of learning to recognize new classes not seen
during training, has received considerable attention in the case of 2D image
classification. However despite the increasing ubiquity of 3D sensors, the
corresponding 3D point cloud classification problem has not been meaningfully
explored and introduces new challenges. This paper extends, for the first time,
transductive Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL)
approaches to the domain of 3D point cloud classification. To this end, a novel
triplet loss is developed that takes advantage of unlabeled test data. While
designed for the task of 3D point cloud classification, the method is also
shown to be applicable to the more common use-case of 2D image classification.
An extensive set of experiments is carried out, establishing state-of-the-art
for ZSL and GZSL in the 3D point cloud domain, as well as demonstrating the
applicability of the approach to the image domain.Comment: WACV 202
Any-Shot Object Detection
Previous work on novel object detection considers zero or few-shot settings
where none or few examples of each category are available for training. In real
world scenarios, it is less practical to expect that 'all' the novel classes
are either unseen or {have} few-examples. Here, we propose a more realistic
setting termed 'Any-shot detection', where totally unseen and few-shot
categories can simultaneously co-occur during inference. Any-shot detection
offers unique challenges compared to conventional novel object detection such
as, a high imbalance between unseen, few-shot and seen object classes,
susceptibility to forget base-training while learning novel classes and
distinguishing novel classes from the background. To address these challenges,
we propose a unified any-shot detection model, that can concurrently learn to
detect both zero-shot and few-shot object classes. Our core idea is to use
class semantics as prototypes for object detection, a formulation that
naturally minimizes knowledge forgetting and mitigates the class-imbalance in
the label space. Besides, we propose a rebalanced loss function that emphasizes
difficult few-shot cases but avoids overfitting on the novel classes to allow
detection of totally unseen classes. Without bells and whistles, our framework
can also be used solely for Zero-shot detection and Few-shot detection tasks.
We report extensive experiments on Pascal VOC and MS-COCO datasets where our
approach is shown to provide significant improvements
RealSmileNet: A Deep End-To-End Network for Spontaneous and Posed Smile Recognition
Smiles play a vital role in the understanding of social interactions within
different communities, and reveal the physical state of mind of people in both
real and deceptive ways. Several methods have been proposed to recognize
spontaneous and posed smiles. All follow a feature-engineering based pipeline
requiring costly pre-processing steps such as manual annotation of face
landmarks, tracking, segmentation of smile phases, and hand-crafted features.
The resulting computation is expensive, and strongly dependent on
pre-processing steps. We investigate an end-to-end deep learning model to
address these problems, the first end-to-end model for spontaneous and posed
smile recognition. Our fully automated model is fast and learns the feature
extraction processes by training a series of convolution and ConvLSTM layer
from scratch. Our experiments on four datasets demonstrate the robustness and
generalization of the proposed model by achieving state-of-the-art
performances.Comment: Accepted by ACC
S2FGAN: Semantically Aware Interactive Sketch-to-Face Translation
Interactive facial image manipulation attempts to edit single and multiple
face attributes using a photo-realistic face and/or semantic mask as input. In
the absence of the photo-realistic image (only sketch/mask available), previous
methods only retrieve the original face but ignore the potential of aiding
model controllability and diversity in the translation process. This paper
proposes a sketch-to-image generation framework called S2FGAN, aiming to
improve users' ability to interpret and flexibility of face attribute editing
from a simple sketch. The proposed framework modifies the constrained latent
space semantics trained on Generative Adversarial Networks (GANs). We employ
two latent spaces to control the face appearance and adjust the desired
attributes of the generated face. Instead of constraining the translation
process by using a reference image, the users can command the model to retouch
the generated images by involving the semantic information in the generation
process. In this way, our method can manipulate single or multiple face
attributes by only specifying attributes to be changed. Extensive experimental
results on CelebAMask-HQ dataset empirically shows our superior performance and
effectiveness on this task. Our method successfully outperforms
state-of-the-art methods on attribute manipulation by exploiting greater
control of attribute intensity